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This paper develops a statistical learning approach to identify potentially new high- 
temperature ferroelectric piezoelectric perovskite compounds. Unlike most computational 
studies on crystal chemistry, where the starting point is some form of electronic structure 
calculation, we use a data-driven approach to initiate our search. This is accomplished 
by identifying patterns of behaviour between discrete scalar descriptors associated with 
crystal and electronic structure and the reported Curie temperature (Tq) of known 
compounds; extracting design rules that govern critical structure-property relationships; 
and discovering in a quantitative fashion the exact role of these materials descriptors. Our 
approach applies linear manifold methods for data dimensionality reduction to discover 
the dominant descriptors governing structure-property correlations (the 'genes') and 
Shannon entropy metrics coupled to recursive partitioning methods to quantitatively 
assess the specific combination of descriptors that govern the link between crystal 
chemistry and Tq (their 'sequencing'). We use this information to develop predictive 
models that can suggest new structure/chemistries and/or properties. In this manner, 
BiTm0 3 -PbTi0 3 and BiLu0 3 -PbTi0 3 are predicted to have a T c of 730° C and 705° C, 
respectively. A quantitative structure-property relationship model similar to those used 
in biology and drug discovery not only predicts our new chemistries but also validates 
published reports. 

Keywords: inorganic gene; high-temperature piezoelectrics; statistical learning; 
information theory; data-driven modelling 



1. Introduction 

Through many seminal papers, Alan McKay has expounded on the idea of a 
framework for 'Generalized Crystallography' (Mackay 1966, 1974, 1977, 1986). 
He has proposed that 'the crystal is a structure, the description of which is much 
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smaller than the structure itself and that this description of structure serves as 
a 'carrier of information' about the structure on larger length scales (MacKay 
2002). He went on to suggest that these components of description of structure 
can help develop a 'biological approach to inorganic systems' and proposed 
the construction of an 'inorganic gene'. This paradigm serves as motivation 
underlying the present study by exploring how fundamental pieces of information, 
treated as discrete bits of data, can collectively characterize the stability and 
properties of a given crystal chemistry. We show how the use of statistical learning 
tools including fundamental concepts borrowed from information theory can be 
used to characterize a crystal structure in terms of fundamental descriptors 
of information (i.e. the 'genes') and how these pieces of information interact 
or are 'sequenced' to guide the characteristics of that crystal structure and 
in fact help to guide the development of new crystal chemistries and targeted 
physical properties. 

The challenge in defining the 'gene' in inorganic crystal chemistry is to 
characterize the appropriate combination of discrete characteristics associated 
with crystal chemistry that collectively define a particular property or set of 
properties of the material. Normally, structure-property relationships are guided 
by defined functional relationships (e.g. electronic structure calculations to define 
energy landscapes associated with crystal chemistry). However, we propose an 
approach to establish such a structure-property relationship where we do not 
assume any specific formulation linking structure with property (Johannesson 
et al 2002; Curtarolo et al 2003; Woodley et al 2004; Dudiy & Zunger 2006; 
Fischer et al 2006; Sluiter 2007; Mohn & Kob 2009; Oganov & Valle 2009). 
Rather, we take a data-driven approach where we seek to establish structure- 
property relationships by identifying patterns of behaviour between known 
discrete scalar descriptors associated with crystal and electronic structure and 
observed properties of the material. From this, we extract design rules that allow 
us to systematically identify critical structure-property relationships, resulting 
in identifying in a quantitative fashion the exact role of specific combination 
of materials descriptors (i.e. genes) that govern a given property. This is the 
foundation of the concept of the quantitative structure-activity (or property) 
relationship (QSAR/QSPR) widely used in the held of organic chemistry and 
drug discovery. The mathematical underpinning of developing a QSPR-type 
relationship is statistical learning (a term encompassing a broad range of tools 
derived from statistics, data mining and machine learning). In our group, we 
have applied this approach to explore a variety of questions associated with 
crystal chemistry (Suh & Rajan 2005, 2009; Gadzuric et al 2006; Rajagopalan & 
Rajan 2007; George et al 2009; Broderick et al 2010; Rajan 2010, Zenasni 
et al 2010), and in this paper, we demonstrate that by using the QSPR 
concept, we can identify through the tools of statistical inference, how discrete 
bits of information that define a robust QSPR relationship can be sequenced 
to help identify new materials with new and targeted properties. The specific 
objective of the present study is identifying, through the sole use of statistical 
learning methods, new high-temperature piezoelectric ferroelectrics. However, 
this paper also serves as a generic template for an information science-based 
materials discovery and design strategy, in the spirit of Mackay's proposition of 
an inorganic gene. 
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2. Background 

(a) Materials chemistry of high-temperature piezoelectrics 

Historically, the design of materials chemistry for high-temperature piezoelectric 
behaviour has been guided by an apparent linear relationship between 
Goldschmidt's tolerance factor (£) and Curie temperature (Tc) a ^ the 
morphotropic phase boundary (MPB) composition of the PbTiOs (PT)-based 
end- member solid solutions (Eitel et al 2001; Duan et al 2004). However, the 
use of the tolerance factor as a 'figure of merit' has had limited impact in 
developing or identifying new materials via experiment (Eitel et al. 2001; Duan 
et al. 2004) or computation (Baettig et al. 2005), owing to the fact that it 
captures only a very limited set of variables (i.e. ionic radii) describing a given 
perovskite crystal chemistry (Thomas 1997). The motivation of our work is to 
find alternative computational based methods that can help to refine the chemical 
search space and identify potentially new and promising piezoelectric materials 
for high-temperature applications. 

The chemical search space of known and predicted perovskite-based 
ferroelectric compounds in BiMeOs-PbTiOs solid solution is mapped in figure 1, 
where Me is a single cation with charge 3+ or a combination of two different 
cations (Mei/2Mei/2, Me2/3Mei/3 and Me3/4Mei/4) with an average charge 3+, 
occupying the octahedral site of the perovskite lattice (Eitel et al. 2001; 
Grinberg et al. 2005; Suchomel & Davies 2005; Stein et al. 2006; Grinberg & 
Rappe 2007). The solid solutions were classified based on the chemical origin 
of ferroelectric instability caused by Me cations. The distinction between 
strong (filled red circles) and weak (filled green squares) ferroelectric activity 
was made based on the degree of off-centring tendency of Me cations in 
MeOe octahedra. Clearly, the search space is sparse in the high-temperature 
region, and our goal is to explore the vast combinatorial search space and 
identify new high-temperature piezoelectric chemistries. In this work, we 
have focused primarily on identifying a new Me 3+ cation that satisfies the 
following conditions: 

— it must show weak ferroelectric activity; 

— BiMeOs must have a stable perovskite structure at ambient or non-ambient 
(high-pressure/-temperature) conditions; and 

— the resulting BiMe03-PbTi03 solid solution should have a high Tc- 

We explore a data-driven methodology that involves applying statistical 
learning tools to analyse correlations between numerous scalar descriptors of 
electronic and crystal structure parameters of known perovskite piezoelectric 
compounds and using that information in turn to develop predictive models that 
can suggest new structure/chemistries and/or properties based purely on the 
formalism of statistical learning methods. This methodology is quite different 
from the approach that is widely reported by many groups where large numbers 
of high-throughput electronic structure computations are conducted to seek 
compound chemistries with energy minima (where data mining- related techniques 
are embedded in the computation to help the efficiency of the calculations); 
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Figure 1. In this figure, we map the Curie temperature (2c) of known and predicted perovskite- 
based ferroelectric compounds in the chemical space of BiMe03-PbTi03 solid solution, where 
Me is a single cation with charge 3+ (e.g. Al, Sc, In, etc.) or a combination of two different 
cations Mei/ 2 Mei/ 2 (e.g. ZnTi, ZnZr, ZnSn, etc.), Me 2 /3Mei/ 3 (e.g. ZnNb, MgNb) and Me 3 / 4 Mei/4 
(e.g. ZnW, MgW, ScGa) with an average charge 3+ and that occupies the octahedral site of the 
perovskite lattice (Eitel et al 2001; Grinberg et al 2005; Suchomel & Davies 2005; Stein et al 
2006; Grinberg & Rappe 2007). The target design space represents the high-temperature regime 
that is of interest to us, and, as it can be clearly seen, the chemical search space is sparse in 
this region with as many as only three compounds being identified. For reference, Tq of PbZr03~ 
PbTi03 solid solution is also indicated in this figure. Our objective is to systematically explore 
the complex chemical search space and identify potentially new piezoelectric materials that have 
high Tq. In this article, we report our computational work, where we have focused particularly 
on identifying a suitable Me 3+ cation (which is weakly ferroelectrically active and occupies the 
octahedral site of the perovskite lattice) that can significantly enhance the Tq of BiMe03-PbTi03 
solid solution. The distinction between strong and weak ferroelectric activity was made based on 
the degree of off-centring tendency of Me cations in Me06 octahedra. Filled circles, Me cations 
that show strong ferroelectric activity; filled squares, Me cations that show weak ferroelectric 
activity; filled triangles, Me cations that show strong and weak ferroelectric activity. (Online version 
in colour.) 



and then potentially new stable compounds are identified by identifying those 
that have energy minima but not reported in known experimental databases 
(Johannesson et al 2002; Curtarolo et al 2003; Woodley et al 2004; Dudiy & 
Zunger 2006; Fischer et al 2006; Sluiter 2007; Mohn & Kob 2009; Oganov & 
Valle 2009). 

Our approach requires the need to carefully establish a dataset of descriptors 
on which we directly apply statistical learning tools. The number of parameters 
needed to predict even relatively simple structures can be large if one 
has to capture both geometrical and bonding characteristics of that crystal 
chemistry. One of the arguments we are trying to put forward in this paper 
is that although the potential number of variables can in fact be large, data 
dimensionality reduction and information theoretic techniques can help reduce 
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Figure 2. (a) A network of corner-sharing B06 octahedra with a large A-site cation occupying 
the interstitial position is shown, (b) The simplified unit-cell representation of cubic perovskite 
without showing coordination, (c) The geometry of the building units, AO12 cuboctahedra and 
B06 octahedra, with 12-coordinated A-site and 6-coordinated B-site, respectively. The description 
of the crystal structure in the form of structural building units presents a number of diverse choices 
to develop new descriptors based on the site chemistry and coordination. (Online version in colour.) 



it to a manageable number. This paper describes a data mining strategy from 
which effective classification and predictive models can be developed using 
high-dimensional information. 



(b) Defining the chemical search space 

The search for new high-temperature piezoelectric materials by chemical 
modification of PbTi03 perovskite at both Pb and Ti sites has been an area 
of considerable interest in the last decade (Saghi-Szabo et al. 1998; Eitel 
et al. 2001). While there are many crystal structures that may be suitable 
for high-temperature piezoelectric application, such as perovskites, langasites 
(Damjanovic 1998) and perovskite-like layered structures (Yan et al 2009), we 
are interested in perovskites because they have the best combination of high 
temperature and piezoelectric properties compared with other structures, and 
many perovskites are also ferroelectrics, which can be used as piezoelectric 
materials when poled (Cohen 2008; Rodel et al. 2009). The crystal structure 
of an ideal perovskite crystal is shown in figure 2. Following the discovery of the 
crucial role of Bi in enhancing the ferroelectric properties in PbTi03 (Ihiguez 
et al. 2003) , numerous experimental and theoretical studies focusing on BiMeOs- 
PbTiOs solid solutions were carried out (where Me represents a single cation 
with charge 3+ or a combination of cations with an average charge 3+) with the 
further objective of identifying a potential Me cation that can maximize both 
Curie temperature and ferroelectric properties of the solid solution (Suchomel & 
Davies 2004, 2005; Grinberg et al. 2005; Stein et al. 2006; Stringer et al. 2006; 
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Chen et al. 2007, 2009; Grinberg & Rappe 2007). The key findings from the earlier 
studies are summarized below: 

— Enhancement of ferroelectric properties and Curie temperature owing to 
the presence of strongly ferroelectrically active Me cations (e.g. Ti 4+ , 
Zn 2+ , Fe 3+ , etc.). These strongly ferroelectrically active Me cations cause 
hybridization of Me-0 bonds in MeOe octahedra, leading to distortions 
resulting in significant ionic displacement from the ideal position (Cohen 
1992, 2008; Rodel et al. 2009). The ionic displacements were responsible 
for enhanced polarization and ferroelectric properties. Some examples of 
compounds with strongly ferroelectrically active Me cations are BiFe0 3 - 
PbTi0 3 and Bi(ZnTi)0 3 -PbTi0 3 . 

— On the other hand, it was found that the presence of weakly 
ferroelectrically active Me cations (e.g. Sc 3+ , Mg 2+ and Yb 3+ ) can 
also enhance the high-temperature ferroelectric properties. In this case, 
the Me cations do not lead to hybridization of Me-0 bonds, whereas 
the steric effect causes the Pb/Bi cation to avoid the larger Me/Ti 
cation owing to the larger wave-function overlap (therefore stronger 
Pauli repulsion) and move towards the smaller cation. The stronger 
repulsion leads to increased Pb/Bi cation displacement, which in turn 
results in enhanced ferroelectric behaviour (Grinberg et al. 2005). Some 
examples of compounds with weakly ferroelectrically active Me cations 
are BiSc0 3 -PbTi0 3 and BiYb0 3 -PbTi0 3 . 

Our chemical search space is defined in electronic supplementary material, 
figure SI, and we have focused particularly on identifying a suitable BiMe0 3 
perovskite end member, where Me is a single cation that is weakly ferroelectrically 
active with a formal charge 3+ and that can form a solid solution with PbTi0 3 
at ambient conditions. 



3. Statistical learning computational strategy 

(a) Introduction to tolerance factor-Tc model 

Eitel et al. (2001) first discovered the existence of an apparent linear 
relationship between tolerance factor of AB0 3 end-member compositions and 
Curie temperature at MPB for a large number of AB0 3 -PbTi0 3 solid solutions, 
although there was some significant scatter (figure 3). Grinberg et al. (2005) 
later addressed this scatter by identifying that the data fall into two clusters, 
and they showed that both clusters exhibited a linear dependence of Curie 
temperature on the end-member tolerance factor but had different slopes. The 
physical reasons behind the two slopes were correlated to the differences in 
the ferroelectric activity of various B-site cations of the AB0 3 end-member 
compositions. While both models can be applied to quantitatively predict the Tc, 
neither predicts the perovskite phase stability of the AB0 3 -PbTi0 3 solid solution. 
This is a major shortcoming because only those AB0 3 -PbTi0 3 solid solutions 
that form a pure perovskite phase at ambient conditions are technologically useful 
(Grinberg et al. 2005). 
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Figure 3. The univariate tolerance factor- Tq model of Eitel et al. (2001) is shown here. The 
shortcomings of the univariate tolerance factor- Tq model are clearly noticeable as the data show 
significant scatter owing to the presence of two clusters of compounds with different physics. This 
indicates that the tolerance factor is only a necessary condition and not sufficient for modelling Tq . 
We have addressed the shortcomings of the tolerance factor- Tq model by developing a multivariate 
model that considers six key crystal chemical descriptors instead of only the tolerance factor. 
Notation for chemical compounds and parameters are described in the electronic supplementary 
material. (Online version in colour.) 



We have collectively addressed the above-mentioned shortcomings of the 
tolerance factor- Tq model in a couple of ways. Firstly, by considering additional 
crystal chemical descriptors, a reasonably accurate multivariate model was 
developed (described in §46) using linear manifold methods for quantitatively 
predicting the Tq at MPB of AB03-PbTi03 solid solutions. To reduce the scatter, 
instead of including all ferroelectric AB03-PbTi03 chemistries that contain both 
strongly and weakly ferroelectrically active cations, we have typically considered 
end members that belong to Pb(BiB 2 )03 and BiMeOs perovskites, where Bi, 
B2 and Me are cations that occupy the octahedral site of the perovskite lattice 
and Me cation is weakly ferroelectrically active. By clearly defining our chemical 
search space in this manner, we focus on the relevant physics that best describes 
our objective. 

Secondly, in order to determine the perovskite phase stability of the ABO3- 
PbTiOs solid solution, we have developed an independent classification model 
based on information theory concepts (e.g. Shannon entropy) that tracks which 
combination of parameters influences the perovskite structural stability by 
partitioning a high-dimensional dataset. As noted by Karnani et al (2009), 
natural data structures, such as genomes, books, file systems and data servers, are 
repositories of information that share common characteristics. Also, they display 
skewed distributions and hierarchical organization, which certainly applies to 
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crystallographic data. The physical representation of information allows us to 
understand that these ubiquitous characteristics are consequences of the second 
law. Thus, by combining the linear manifold methods with the information theory 
concepts, we can identify new high-temperature piezoelectric materials. 

(b) Informatics-based computational strategy 

Our computational logic for designing new high-temperature piezoelectric 
chemistries is summarized in the form of a flow chart in the electronic 
supplementary material, figure S2. The logic involves three steps, (i) Identification 
of a relevant descriptor set that fully describes the high-temperature behaviour 
of ABO3 perovskites. Thirty attributes were screened using principal component 
analysis (PCA) and a reduced set of six key attributes was identified that showed 
high correlation with the transition temperature, (ii) Development of a robust 
multivariate model using partial least squares (PLS) that predicts Tq at MPB 
of AB03-PbTi03 solid solutions. By applying the PLS model, new candidate 
chemistries were identified that are suitable for high-temperature applications, 
(hi) Screening for the piezoelectric behaviour in the new candidate chemistries 
by testing the perovskite structural stability of ABO3 end members. For this 
purpose, new classification models were developed using a recursive partitioning 
strategy. The outcome of this analysis is important for determining whether it is 
possible to synthesize a pure perovskite phase in the AB03~PbTi03 solid solution. 
Only those ABO3 end members that were classified to have a stable perovskite 
structure-type by recursive partitioning were chosen and identified as potential 
high-temperature piezoelectric materials. The mathematics of PCA, PLS and 
recursive partitioning in the context of our specific datasets is summarized in the 
electronic supplementary material. 

Before elaborating on the data mining methods, we need to address the 
obvious concern that at first glance the statistical learning methods do not in 
themselves explicitly solve the energy minimization problem that the physics- 
based calculations do. However, this concern is addressed collectively in a couple 
of ways. The first is that we are searching for a high-dimensional correlation 
between attributes of compounds that already exist and hence are by definition 
stable. In fact, a corollary to this point is that mathematically we are using 
convex optimization methods that help to ensure we have a global minimum 
(Izenman 2008). Second, we test the validity of our models with respect to the 
target materials properties (i.e. Curie temperature in this case) by using well- 
established and robust methods for being able to reproduce the known data, to 
give us the statistical confidence of the models we develop. 

4. Results and discussion 

(a) Identifying the relevant descriptor set: the inorganic genes 

As noted above, the tolerance factor as the sole figure of merit to design new 
high-temperature piezoelectric perovskite compounds appears to be insufficient. 
To look beyond the tolerance factor to predict new high-temperature piezoelectric 
materials, we have surveyed over 30 different attributes (table 1) associated with 
crystal geometry, bonding, thermodynamics and electronic structure of 22 simple 
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Table 1. Enumeration of 30 descriptors used in the principal component analysis (PC A) for 
identifying the relevant inorganic gene is given in this table. The underlying rationale behind 
choosing these different attributes associated with crystal geometry, bonding, thermodynamics 
and electronic structure was to fully describe the crystal chemistry of perovskite-based compounds 
that is relevant for modelling the ferroelectric behaviour, and the search was motivated by the past 
experimental and theoretical work of Abrahams et al (1968), Igarashi et al. (1987), Singh et al 
(1988), Ravez et al (1997), Goudochnikov & Bell (2007) and Grinberg & Rappe (2007). 



abbreviation description 



r A (A) Shannon's (1976) ionic radii of A-site (12-coordination) 

re (A) Shannon's ionic radii of B-site (6-coordination) 

t tolerance factor calculated using ionic radii 

^A-o(A) ideal A-0 bond distance (Brese & O'Keeffe 1991) 

^B-o(A) ideal B-0 bond distance 

£bv tolerance factor calculated using an d ofe-0 

^EA(kJmol -1 ) A-site electron affinity (Hotop & Lineberger 1985) 

^EFF-S A-site effective nuclear charge — Slater scale (Slater 1930) 

^-EFF-C A-site effective nuclear charge — Clementi scale (Clementi & Raimondi 1963) 

^-EFF-F A-site effective nuclear charge — Froese-Fisher scale (Froese-Fischer 1972) 

^EFF-S B-site effective nuclear charge — Slater scale 

^EFF-C B-site effective nuclear charge — Clementi scale 

^EFF-F B-site effective nuclear charge — Froese-Fisher scale 

^ WS (A) A-site Wigner-Seitz cell radius (Skriver 2004) 

5ws(A) B-site Wigner-Seitz cell radius 

^-EN-P A-site electronegativity — Pauling scale (Pauling 1960) 

^EN-AR A-site electronegativity — Allred-Rochow scale (Allred & Rochow 1958) 

^EN(eV) A-site electronegativity — absolute scale (Pearson 1988) 

^EN-P B-site electronegativity — Pauling scale 

^EN-AR B-site electronegativity — Allred-Rochow scale 

^EN(eV) B-site electronegativity — absolute scale 

Da (A) ionic displacement (Grinberg & Rappe 2007) of A-site 

L>b (A) ionic displacement of B-site 

Aif^ 0 (J mol -1 ) enthalpy of formation (Saxena 1993) of A oxide 

AifgQpmol -1 ) enthalpy of formation of B oxide 

Ai/^ B03 (J mol -1 ) enthalpy of formation of ABO3 

a (A) lattice constant (Matsui & Nomura 1981) 

6(A) lattice constant 

c(A) lattice constant 

V/Z(A 3 ) volume of unit cell/coordination number 

7t(K) transition temperature 



ABO3 perovskite chemistries with known transition temperatures (Shannon 1976; 
Matsui & Nomura 1981; Saxena 1993; Emsley 1998; Brown 2002; Suh & Rajan 
2005; Goudochnikov & Bell 2007; Grinberg & Rappe 2007; Makov et al 2009; 
Pettersson et al 2009; Rajan 2010). The transition temperature of an ABO3 
compound is defined as the temperature when the crystal structure of ABO3 
changes from low symmetry to the highest possible symmetry. While not all of 
the ABO3 compounds assessed are ferroelectric, the objective of this work is 
unaffected, since the final goal is to suggest new perovskite-based end members 
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Figure 4. Loadings plot between PCI and PC2 showing the interactions of 30 descriptors captured 
by PC A. Based on the angle the degree of correlation between the target variable and other 
attributes is established. Two zones are marked in the figure that show a strong correlation with the 
target variable (Tt): the red zone (with stripes) signifies attributes that show positive correlation 
with Tt and the green zone (no stripes) signifies variables that show negative correlation with Tt. 
The abbreviations of the attributes are provided in table 1. (Online version in colour.) 



forming solid solutions with PT. Alloying an ABO3 perovskite compound with 
PbTi03 has the potential to lead to a high piezoelectric characteristic in the 
resulting AB03-PbTi03 ceramic (Grinberg & Rappe 2004). 

To identify the complex relationships between physical properties and 
crystal chemistry and geometry from the existing knowledge base, PCA is 
employed (Ericksson et al 2001; Rajan 2005; Ringner 2008). The input X = 
{xi, X2, £3, . . . , x n ] e Re nxd (where n = 22 and d = 30 denote the number of 
ABO3 compounds and the number of physical attributes quantifying each 
ABO3 compound, respectively) is initially preprocessed by mean-centring and 
standardization. PCA reduces the dimensionality of the data by identifying new 
latent variables (called principal components, PCs) that capture the largest 
amount of variation in the data. Each PC is a linear combination of the weighted 
contribution of each attribute. By comparing the magnitude and direction of 
the weighted contribution from each attribute, the correlation structure in the 
high-dimensional data is discovered). 

Figure 4 (referred to as a loading plot) shows the uncovered correlations 
between the physical attributes for the first two PCs. The transition temperature 
(Tt) is the target variable against which all correlations are computed. As we are 
using linear manifold methods, we have employed Euclidean geometrical mapping 
to help interpret these plots. The degree of correlation between any attribute and 
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T t is determined by the cosine of the angle (0) between the attribute and T t (angle 
between attribute origin- T t ) within the loading plot. If 0 = 0°, the attribute and 
T t are highly positively correlated, if 0 = 180°, then they are highly negatively 
correlated and if 0 = 90°, there is no correlation between the attribute and T t . In 
figure 4, two zones that show the strongest correlation of the attributes with T t 
are explicitly marked, with the assumption that the first two PCs capture such a 
high percentage of the data's information that the other PCs do not need to be 
explicitly considered. The attributes tb (ionic radii of B-site), c?b-o (ideal B-0 
bond distance based on the bond- valence model), Aii/feo (enthalpy of formation 
of BO oxide) and b (lattice constant) correlate positively with T t , while ta (ionic 
radii of A-site), g?a-o (ideal A-0 bond distance based on the bond- valence model), 
t (tolerance factor calculated using ionic radii), fev (tolerance factor calculated 
using the bond- valence method), i?EN (B-site electronegativity — absolute scale), 
i?Eff (B-site effective nuclear charge) and V/Z (volume of unit cell/coordination 
number) correlate negatively with T t . Our PC A model reproduces the well- 
known inverse linear relationship between tolerance factor (£) and T t . Based 
on the removal of redundancy and consideration of available data, we have 
determined that six attributes (ta, t, i?EN 5 ^a-o 5 ?b and ^b-o) are appropriate 
for describing T t . By identifying these attributes, we can more fully describe 
the high-temperature behaviour than possible by only considering the tolerance 
factor (£), and the selection of only the highly correlated attributes ensures the 
robustness of the model. 

(b) Identifying new high-temperature perovskites: developing a 'QSPR 7 

To test for high-Xc piezoelectric materials, we have applied PLS regression 
(Ericksson et al 2001) to predict Tq at the MPB of the end-member PbTiOs 
solid solution. PLS is particularly suitable for handling sparse data with strongly 
correlated attributes. The piezoelectric materials database for predicting Tq as 
a function of six attributes (ta, t, Ben, ^a-o 5 ?b and c?b-o) is taken from the 
published work of Eitel et al (2001) and Grinberg et al (2005). This new QSPR 
formulated using PLS is given by 

T c = -(789.912 x t) - (153.932 x r A ) + (1013.981 x r B ) + (796.5864 x d B - 0 ) 
- (138.9 x d A - 0 ) - (55.6076 x B EN ) - 526.537. 

Fifteen compounds were used for training the model and an independent set of 
five compounds (not used during the training) was used for testing (figure 5). 
Our QSPR model takes into account the physics of mismatch of bond lengths 
(£), ionic size (ta and r^), bond lengths (g?a-o an d ^b-o) an d chemical bonding 
at the B-site (Ben), thereby accounting for a far greater diversity of attributes 
in comparison to the previous model where only mismatch of bond lengths was 
considered. Some of the descriptors captured in our QSPR model are also in the 
original description of the tolerance factor. However, only two (rs and ta) of the 
six descriptors are explicitly used in the tolerance factor formulation, 
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Figure 5. Multivariate predicted model (abscissa) in comparison with the measured Tq as reported 
in the literature (Eitel et al. 2001; Grinberg et al. 2005) is shown for the PbTiOs end members. 
The model was developed by using 15 chemistries and tested for hve chemistries. The new figure of 
merit is T c = -(789.912 x t) - (153.932 x r A ) + (1013.981 x re) + (796.5864 x d B -o) ~ (138.9 x 
^A-o) — (55.6076 x 5en) — 526.537. Based on the new figure of merit, the Tq of new piezoelectric 
chemistries BiTm0 3 -PT and BiLu0 3 -PT were predicted to be 730° C and 705° C, respectively 
(labelled red in the figure). It should be noted that the T c of BiTm0 3 -PT and BiLu0 3 -PT 
plotted in the figure is only the predicted value and needs to be experimentally validated. Notation 
for chemical compounds and parameters are described in the electronic supplementary material. 
Filled circles, training set; filled triangles, test set; plus symbols, new predictions. (Online version 
in colour.) 



while the other four descriptors are not explicitly used. For end members that 
had more than one cation in the octahedral site, such as Pb(BiB2)0 3 , we 
considered the arithmetic mean value of Bi and B 2 . It should be noted, although 
not elaborated in this paper, that the classification of Me ions into weakly 
and strongly ferroelectric active species can be accomplished by exploring more 
descriptors such as polarizability, ionic valence and ionic size. 

The additional diversity of the QSPR model has a clear advantage as compared 
with the model based solely on tolerance factor. For many compounds, the QSPR 
model is in reasonable agreement with the tolerance factor model. However, in 
some cases, the mismatch of bond length is not sufficient for modelling the physics 
of the system. For the systems predicted here, BiLu0 3 -PbTi0 3 is predicted to 
have a higher Tq than any systems included in the training dataset; however, 
this result is not found when using the tolerance factor model. Therefore, we 
conclude that our developed QSPR is highly robust in predicting the Tq of 
unknown compounds (figure 5) and has a more broad significance when applied 
to new materials. Based on this QSPR model, a search of all the elements in the 
periodic table that best satisfy the correlation criterion involving the combination 
of attributes was performed. The search has resulted in generating four new 
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ABO3 chemistries (BiTmOs, BLL11O3, BLH0O3 and BiErC^) as potential high- 
Tq materials. Having identified the new chemistries, we then tested them for 
their crystal structure-type. 

(c) Screening for piezoelectric behaviour: 'sequencing the gene 7 

To test for the perovskite structural stability, a new classification model was 
developed using a recursive partitioning strategy (Witten & Frank 2000; Hall 
et al 2009) on a large database (taken from the work of Zhang et al 2007 and 
references therein) of 355 ABO3 stoichiometric compounds (227 perovskites and 
128 non-perovskites) to track which combination of parameters influences the 
perovskite structural stability by partitioning a high-dimensional dataset. The 
outcome of this analysis is important for determining whether it is feasible to 
synthesize a pure perovskite phase in the BiBOs-PbTiOs solid solution (where 
B = Tm, Lu, Ho, Er). Our hypothesis is, if BiTm0 3 , BiLu0 3 , BiHo0 3 and BiEr0 3 
compounds are predicted to have a stable perovskite structure-type at ambient 
or non- ambient (high pressure/temperature) condition, then we propose that it 
is possible to experimentally obtain a pure perovskite phase in BiB03-PbTi03 
solid solution (where B = Tm, Lu, Ho, Er). Here, we explain the relevance of this 
hypothesis using a few examples based on experimental observations. 

It is well known that obtaining a pure Bi-based perovskite is difficult under 
conventional processing methods at ambient conditions. For example, a pure 
perovskite phase in BiScOs is synthesized only at 6GPa pressure and 1140°C 
temperature (Belik et al 2006 a, b) and in BiMnOs a pure perovskite phase is 
obtained only at pressures greater than 4 GPa and 750° C temperature (Montanari 
et al 2005). However, solid solutions of BiSc0 3 -PbTi0 3 (Zhang et al 2003) 
and BiMn03-PbTi03 (Woodward & Reaney 2004) have been experimentally 
synthesized and are shown to have a pure perovskite phase. Even in the case 
of very low tolerance factor end members such as BiYbOs (tolerance factor = 
0.857), there are experimental reports that confirm the limited solubility of 
BiYb03 m PbTi03. Feng et al (2009) using conventional ceramic processing 
methods synthesized a solid solution of 0.05BiYbO3-0.95PbTiO3 with the highest 
perovskite phase purity of 97.83 per cent. Obtaining a pure perovskite phase in 
BiYbOs when synthesized at ambient conditions is extremely difficult (Drache 
et al 2004), and we note that there is no experimental or theoretical study on 
structural phase transitions in BiYbOs at high-pressure/-temperature conditions. 
In this work, we have identified for the first time the existence of a stable 
perovskite structure-type in BiYb03 via a recursive partitioning strategy at 
high-pressure/-temperature conditions, and this structural stability at high- 
pressure/-temperature conditions explains the limited solubility of BiYbOs m 
PbTiOs at ambient conditions. Alloying BiYb03 with PbTiOs, which has a large 
c/a ratio, can help stabilize a perovskite phase by applying chemical pressure 
(Ahart et al 2008). 

In this work, we apply our classification model to qualitatively determine the 
feasibility of synthesizing a pure perovskite phase in the BiB03~PbTi03 solid 
solution (where B = Tm, Lu, Ho, Er). In order to capture the physics of perovskite 
stability at high-pressure/-temperature conditions, we have included ABO3 
perovskite compounds such as BiScOs (Belik et al 2006a, &), BiMnOs (Montanari 
et al 2005), BiA10 3 (Belik et al 2006 a, 6), NaSb0 3 (Mizoguchi et al 2004) and 
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YI11O3 (Shannon 1967) that are experimentally known to have a stable perovskite 
structure- type only at extreme pressure/temperature conditions. Therefore, 
the design rules that we extract from our classification model are applicable 
to identify new perovskites at both ambient and high-pressure/-temperature 
conditions. Using the Shannon entropy as a selection criterion, a hierarchical 
set of design rules was formulated to develop classification schemes that hitherto 
have been approached by empirical observation (Plenio & Vitelli 2001; Shell 2008; 
Karnani et al. 2009). 

The expected information required to classify an ABO3 compound solely based 
on its proportion in the database D is given by the Shannon entropy H(D), which 
is defined as 

m 

H(D) = -2>log 2 ( Pi ), 

i=l 

where pi is the probability that an arbitrary tuple in 'D' belongs to perovskite 
crystal structure or not. A log function of base 2 is used, because the information is 
encoded in bits and m is an integer with distinct values defining m distinct classes 
(Han & Kamber 2006). We formulated our recursive partitioning as a binary 
classification problem. Further details on the construction and interpretation of 
the dendrogram are provided in the electronic supplementary material. 

The aim of the classification is to track precisely which and how variables 
contribute to perovskite structural stability. The output from a recursive 
partitioning analysis is a dendrogram (or a tree diagram) with branches grown on 
each node (attribute) to classify whether a particular ABO3 compound forms a 
perovskite crystal structure. The advantage of the recursive partitioning method 
is that it can efficiently model nonlinear relationships in any arbitrary form 
even when the attributes show strong interactions (Hawkins et al. 1997). Our 
recursive partitioning model classified 336 out of 355 compounds accurately (95% 
accuracy), and the model was validated by a standard 10- fold cross-validation 
technique used in statistics. 

The dendrogram model used for predicting new perovskites is shown in figure 6. 
According to the dendrogram, g?a-o (ideal A-0 bond length calculated based on 
the bond-valence method) is the most significant attribute impacting the phase 
stability of perovskite compounds, followed by the tolerance factor. The leaf 
nodes that are labelled 'yes' and 'no' indicate compounds that may have a stable 
perovskite structure-type or not a perovskite, respectively. From the dendrogram, 
design rules were extracted for predicting new potentially stable perovskite 
compounds. Of the 227 perovskite compounds, 184 obeyed the following rule: if 
d A - 0 > 2.453 and t m < 1.090863 and r A /r B > 1.509872 and B EN -0 EN > 1.42 and 
^a/^b < 2.5625, then the ABO3 compound is a perovskite, where g?a-o i s the 
ideal bond length based on the bond- valence model, ti& is the tolerance factor 
calculated using ionic radii, ta/tb is the ionic radii ratio of A-site to B-site and 
^en-Oen is the electronegativity difference (Pauling scale) between B-cation and 
O-anion. A total of 11 design rules were formulated for testing the perovskite 
structural stability. 

By applying the dendrogram to the four candidate ABO3 compounds, 
only two compounds, BiTmOs and BiLuOs, were identified as having a 
stable perovskite crystal structure at high-pressure/-temperature conditions. 
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Figure 6. The dendrogram (or tree diagram) classification model developed based on the recursive 
partitioning method for identifying new potentially stable perovskite compounds is shown. We 
used the Shannon entropy as a selection criterion to identify key descriptors, and a hierarchical 
set of design rules were formulated to develop classification schemes that have been approached by 
empirical observation. The leaf nodes that are labelled 'yes' or 'no' indicate compounds that may 
have a stable perovskite structure- type or not a perovskite, respectively. From the dendrogram, 
11 design rules were formulated for testing the perovskite structural stability. By applying the 
dendrogram to the four candidate high-temperature materials BiEr03, BiHo03, BiTm03 and 
BiLuOs, only two compounds, BiTm03 and BiLu03, were identified as having the stable perovskite 
crystal structure at high-pressure/-temperature conditions. As a result, BiTm03-PbTi03 and 
BiLu03~PbTi03 solid solutions were identified as new perovskite compounds with a significantly 
high Tq while having piezoelectric behaviour. The dendrogram application of other Bi-based 
systems B1ME03, where ME = Cr, Co, Ga and Ni, also identifies them as having the perovskite 
crystal structure in agreement with the literature (Ishiwata et al. 2002; Baettig et al. 2005; Goujon 
et al. 2008; Oka et al. 2010). In the dendrogram, is the ideal A-0 bond length calculated 

based on the bond- valence method, £tr is the tolerance factor from ionic radii data, r\ is ionic radii 
(Shannon's scale) of the A-site cation with coordination number 12, re is the ionic radii (Shannon's 
scale) of the B-site cation with coordination number 6, 5en~^EN is the electronegativity difference 
(Pauling's scale) between B-site and O-site, A-ionicity is the product of r\/ro and ^en-^EN, 
B-ionicity is the product of tb/tq and 5en~^EN and Gil is the global stability index (Zhang et al. 
2007). (Online version in colour.) 

Experimental synthesis of BiTm03 and B1LUO3 compounds at ambient pressure 
has been attempted in the past but was unsuccessful in synthesizing a pure 
perovskite phase (Drache et al. 2005); however, there are no data available on 
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synthesizing BiTm03 and BiLu03 compounds at high-pressure/-temperature 
conditions. Therefore, we predict for the first time the existence of a stable 
perovskite phase in BiTm03 and BiLu03 compounds at high-pressure/- 
temperature conditions. This result indicates that Tm 3+ (thulium) is the largest 
cation (with an ionic radius of 0.88 A in sixfold coordination) that can occupy 
the octahedral site of a BiMeOs perovskite lattice without impacting its phase 
stability. The dendrogram also predicts the existence of a stable perovskite 
phase in BiYbOs at high-pressure/-temperature conditions. BiYbOs-PbTiOs is 
known as a potential high-temperature piezoelectric material (Eitel et al. 2001; 
Feng et al. 2009), and there are experimental reports that confirm the limited 
solubility of BiYbOs m PbTiOs, thereby forming a solid solution (Feng et al 
2009). Thus, we conclude that it is possible to experimentally obtain a pure 
perovskite phase in BiLu03-PbTi03 and BiTm03-PbTi03 solid solutions. Based 
on the QSPR and the recursive partitioning model, two new perovskite end 
members were identified (BiTm03-PbTi03 and BiLuOs-PbTiOs) and predicted 
to have a high Tq of 730° C and 705° C at the MPB, respectively, while having 
piezoelectric behaviour. 

The focus of this report has been solely on identifying new BiMeOs-PbTiOs 
materials chemistries with higher Curie temperatures, where Me is a weakly 
ferroelectrically active cation with a formal charge 3+. We fully realize that other 
electronic structure parameters such as polarizability and other microstructural 
parameters play a critical role in defining a useful high-temperature piezoelectric 
material. This involves exploring a larger and more diverse chemical space that 
includes more than one Me cation that is strongly ferroelectrically active, which 
is presently being done, as well as experimental verification of our results, which 
will be reported in upcoming publications. 



5. Summary 

We have identified two new perovskite-based piezoelectric crystal chemistries, 
BiTm03-PbTi03 and BiLu03-PbTi03, with significantly higher Curie 
temperature using a highly efficient and robust computational strategy based on 
statistical learning and information theory concepts. The data mining strategy 
we have developed also permits us to identify key physical attributes that appear 
to govern the properties of a given crystal chemistry (e.g. piezoelectrics with a 
high Curie temperature), providing a mechanistic-based discovery process and 
not just a heuristic strategy. Finally, this paper helps to establish the efficacy 
of informatics as an approach to refine the chemical search space for materials 
discovery and to hence serve as a broader template for materials design in 
other applications. 
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